[% setvar title Abstract Internals String Interaction %]
<div id="archive-notice">
    <h3>This file is part of the Perl 6 Archive</h3>
    <p>To see what is currently happening visit <a href="http://www.perl6.org/">http://www.perl6.org/</a></p>
</div>
<div class='pod'>
<a name='TITLE'></a><h1>TITLE</h1>
<p>Abstract Internals String Interaction</p>
<a name='VERSION'></a><h1>VERSION</h1>
<pre>  Maintainer: Simon Cozens &lt;<a href='mailto:simon@brecon.co.uk'>simon@brecon.co.uk</a>&gt;
  Date: 26 Sep 2000
  Mailing List: <a href='mailto:perl6-internals-unicode@perl.org'>perl6-internals-unicode@perl.org</a>
  Number: 322
  Version: 1
  Status: Developing</pre>
<a name='ABSTRACT'></a><h1>ABSTRACT</h1>
<p>All accesses to strings and portions of strings inside the Perl 6 core
should be abstracted.</p>
<a name='DESCRIPTION'></a><h1>DESCRIPTION</h1>
<p>Although RFC 294 states that data will be in UTF8 internally, there's
going to come a time when we'll need to move to UTF16 or other
encodings. If we start fiddling around &quot;knowing&quot; that we're dealing with
UTF8 data, we're going to make it hard on ourselves when it comes to
migrate.</p>
<p>But not only that, UTF8 is really rather gammy to deal with. It's a lot
nicer if we can just treat each string as an array of wide characters.</p>
<p>Take, for instance, the operation of changing a character U+2654 (WHITE
CHESS KING) into U+003F (QUESTION MARK) in the 3rd position of a 12
character string. Because UTF8 is a variable-width character set, we
have a big problem: U+2654 is 3 bytes wide, and U+003F is one byte, we
have to shift the rest of the string down two bytes, effectively having
to rewrite it.</p>
<p>If we used a function or macro to change this single character, it would
also have to rewrite the whole string; using this several times would be
horrifically inefficient.</p>
<p>Even if we &quot;pretend&quot; that we're dealing with an array of wide chars, and
used macros or functions to allow us to keep up this pretence, we'd
still have to do that shifting around, which is nasty. The only way to
get it working without the inefficiency would be to <b>really</b> deal with
an array of wide chars, and have functions to convert such an array from
and to our character set.</p>
<p>The obvious question would be, then, why bother with RFC 294 or UTF8 at
all? Why not just store these arrays along with our SVs? Several
reasons, notably:</p>
<ul>
<li><a name='Memory overhead'></a>Memory overhead</li>
<p>Most of the world's data is still eight-bit; it's probably still even
seven bit. Storing it all as, effectively, raw Unicode is wasteful.</p>
<li><a name='I/O Convenience'></a>I/O Convenience</li>
<p>We can't get an array of wide chars from the user, but we can get UTF8
from them. It's also likely we'll want to output data in a UTF, at least
at times.</p>
</ul>
<a name='IMPLEMENTATION'></a><h1>IMPLEMENTATION</h1>
<p>The following interface is proposed:</p>
<pre>    wchar_t *  Perl_utf_to_raw( U8* string,      STRLEN* len )
    U8 *       Perl_raw_to_utf( wchar_t* string, STRLEN* len )</pre>
<p>All functions which manipulate strings should convert them to raw form
and operate on the array, converting it back when it's time to store it
in the string.</p>
<a name='REFERENCES'></a><h1>REFERENCES</h1>
<p>RFC 294: Internally, data is stored as UTF8</p>
<p>As-yet-unsubmitted RFC : When UTF8 Leaks Out</p>
</div>
